Buying and selling used phones and tablets used to be something that happened on a handful of online marketplace sites. But the used and refurbished device market has grown considerably over the past decade, and a new IDC (International Data Corporation) forecast predicts that the used phone market would be worth $52.7bn by 2023 with a compound annual growth rate (CAGR) of 13.6% from 2018 to 2023. This growth can be attributed to an uptick in demand for used phones and tablets that offer considerable savings compared with new models.
Refurbished and used devices continue to provide cost-effective alternatives to both consumers and businesses that are looking to save money when purchasing one. There are plenty of other benefits associated with the used device market. Used and refurbished devices can be sold with warranties and can also be insured with proof of purchase. Third-party vendors/platforms, such as Verizon, Amazon, etc., provide attractive offers to customers for refurbished devices. Maximizing the longevity of devices through second-hand trade also reduces their environmental impact and helps in recycling and reducing waste. The impact of the COVID-19 outbreak may further boost this segment as consumers cut back on discretionary spending and buy phones and tablets only for immediate needs.
The rising potential of this comparatively under-the-radar market fuels the need for an ML-based solution to develop a dynamic pricing strategy for used and refurbished devices. ReCell, a startup aiming to tap the potential in this market, has hired you as a data scientist. They want you to analyze the data provided and build a linear regression model to predict the price of a used phone/tablet and identify factors that significantly influence it.
The data contains the different attributes of used/refurbished phones and tablets. The data was collected in the year 2021. The detailed data dictionary is given below.
brand_name: Name of manufacturing brand
os: OS on which the device runs
screen_size: Size of the screen in cm
4g: Whether 4G is available or not
5g: Whether 5G is available or not
main_camera_mp: Resolution of the rear camera in megapixels
selfie_camera_mp: Resolution of the front camera in megapixels
int_memory: Amount of internal memory (ROM) in GB
ram: Amount of RAM in GB
battery: Energy capacity of the device battery in mAh
weight: Weight of the device in grams
release_year: Year when the device model was released
days_used: Number of days the used/refurbished device has been used
normalized_new_price: Normalized price of a new device of the same model in euros
normalized_used_price: Normalized price of the used/refurbished device in euros
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
# split the data into train and test
from sklearn.model_selection import train_test_split
# to build linear regression_model
import statsmodels.api as sm
# to check model performance
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
used = pd.read_csv("used_device_data.csv")
used.info()
We have 9 floating variables, 4 object variables, and 2 integer variables. Generally speaking, we can say that our data is composed by 11 numerical variables and 4 categorical ones.
used.describe(include="all").T
!pip install pandas-profiling==3.0.0 ##Installatio of the pandas profiling library.
from pandas_profiling import ProfileReport #Import the tool
report = ProfileReport(used, title = "Data exploration") #Report generation
report.to_notebook_iframe() #Deploy the report on the web
report.to_file("Report.html") #Generate a html file for the report
used.groupby("ram")["brand_name"].value_counts()
#create scatterplot of battery vs. brand
sns.displot(data=used,x="brand_name",y="battery",kind="hist",height=5,aspect=20)
used[used["screen_size"]>6.0].count()
sns.jointplot(data=used,x="brand_name",y="selfie_camera_mp")
Questions:
It looks like a normal distributions with no skewing bias.
An astonishing 93 % of the device market is dominated by android.
Nokia tends to have a smaller RAM while Motorola and Huawei tend to have a larger RAM size.
In general, every brand has extreme values regarding the battery size, but the brands that have the larger battery sizes overall would be Apple, Google and Samsung.
About 3362 cell phones have screens wider tha 6 inches, making it the vast majority of the population.
The distribution of selfie cameras accross brands seems to be skewed to the left heavily, so at least we can infere that it does not follow a normal distribution. The amount of cell phones with a greater than 8MP cameras is rather small in comparison to the ones with cameras of lesser qualities.
The normalized price of a new device, the screen sice, the selfie and main camera mp, and the battery.
used.isnull().any
There are missing data in some rows. Let's identify how many of them.
used.isnull().values.any()
used.isnull().sum()
Because the missing total number of values is less than 10 % of the observations, we can erase those values safely.
df = used.dropna(axis=0)
df.isnull().values.any()
df.describe().T
df.duplicated().sum
There are no duplicated values nor missing values. Now, we will do a new exploratory data analysis and check for outliers before building our model.
reportII = ProfileReport(df, title = "Data exploration II") #Report generation
reportII.to_notebook_iframe() #Deploy the report on the web